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Abstract 

Traditional algorithms for stochastic optimization require projecting the solution at each 
(^ , iteration into a given domain to ensure its feasibility. When facing complex domains, 

^^ ' such as positive semi-definite cones, the projection operation can be expensive, leading 

C^^ , to a high computational cost per iteration. In this paper, we present a novel algorithm 

^^ ' that aims to reduce the number of projections for stochastic optimization. The proposed 

^^ , algorithm combines the strength of several recent developments in stochastic optimization, 

^3 ' including mini-batch, extra-gradient, and epoch gradient descent, in order to effectively 

explore the smoothness and strong convexity. We show, both in expectation and with a 
high probability, that when the objective function is both smooth and strongly convex, the 
proposed algorithm achieves the optimal 0{1/T) rate of convergence with only O(logT) 
projections. Our empirical study verifies the theoretical result. 
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1. Introduction 

The goal of stochastic optimization is to solve the optimization problem 

min F(v/), 

using only the stochastic gradients of i^(w). In particular, we assume there exists a gradient 
oracle, which for any point w S P, returns a random vector g(w) that gives an unbiased 
estimate of the subgradient of -F(-) at w. A special case of stochastic optimization is the 
risk minimization problem, whose objective function is given by 

F(w)=E(,,j^)[£(w;(x,y))], 
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where (x, y) is an instance-label pair, £ is a convex loss function that measures the prediction 
error, and the expectation is taken oven the unknown joint distribution of (x, y) (Zhang, 
2004; Shalev-Shwartz et al., 2009; Hu et al., 2009). The performance of stochastic optimiza- 
tion algorithms is typically characterized by the excess risk 

F(wt) — min F{w), 

where T is the number of iterations and wt is the solution obtained after making T calls 
to the gradient oracle. 

For general Lipschitz continuous convex functions, stochastic gradient descent exhibits 
the unimprovable 0{l/\/T) rate of convergence (Nemirovski and Yudin, 1983; Nemirovski 
et al., 2009). For strongly-convex functions, the algorithms proposed in very recent works (Ju- 
ditsky and Nesterov, 2010; Hazan and Kale, 2011; Rakhlin et al., 2012; Chen et al., 2012) 
achieve the optimal 0(1/T) rate (Agarwal et al., 2012). Although these convergence rates 
are significantly worse than the results in deterministic optimization, stochastic optimiza- 
tion is appealing due to its low per-iteration complexity. However, this is not the case when 
the domain D is complex. This is because most stochastic optimization algorithms require 
projecting the solution at each iteration into domain P to ensure its feasibility, an expen- 
sive operation when the domain is complex. In this paper, we show that if the objective 
function is smooth and strongly convex, it is possible to reduce the number of projections 
dramatically without affecting the convergence rate. 

Our work is motivated by the difference in convergence rates between stochastic and 
deterministic optimization. When the objective function is smooth and convex, under the 
first-order oracle assumption, Nesterov's accelerated gradient method enjoys the optimal 
0(1/T^) rate (Nesterov, 2004, 2005). Thus, for deterministic optimization of smooth and 
convex functions, we can achieve an 0(1/ vT) rate by only performing 0{T^'^) updating. 
When the objective function is smooth and strongly convex, the optimal rate for first-order 
algorithms is 0(l/a ), for some constant a > 1 (Nesterov, 2004, 2007). In other words, 
for deterministic optimization of smooth and strongly convex functions, we can achieve an 
0(1/T) rate by only performing O(logT) updating. The above observations inspire us to 
consider the following questions. 

1. For Stochastic Optimization of Smooth and Convex functions (SOSC), is it possible 
to maintain the optimal 0(l/vT) rate by performing 0{T^''^) projections? 

2. For Stochastic Optimization of Smooth and Strongly Convex functions (SOS^C), is it 
possible to maintain the optimal 0(1/T) rate by performing O(logT) projections? 

For the 1st question, we have found a positive answer from literature. By combining 
mini-batches (Roux et al., 2008) with the accelerated stochastic approximation (Lan, 2012), 
we can achieve the optimal 0(l/vT) rate by performing 0{T^''^) projections (Cotter et al., 
2011). However, a naive application of mini-batches does not lead to the desired O(logT) 
complexity for SOS^C. The main contribution of this paper is a novel stochastic optimization 
algorithm that answers the 2nd question positively. Our theoretical analysis reveals, both in 
expectation and with a high probability, that the proposed algorithm achieves the optimal 
0(1/T) rate by only performing O(logT) projections. 
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2. Related Work 

In this section, we provide a brief review of the existing approaches for avoiding projections. 

2.1. Mini-batch based algorithms 

Instead of updating the solution after each call to the gradient oracle, mini-batch based 
algorithms use the average gradient over multiple calls to update the solution (Roux et al., 
2008; Shalev-Shwartz et al., 2011; Dekel et al., 2011). For a fixed batch size B, the number 
of updates (and projections) is reduced from 0{T) to 0{T/B), and the variance of the 
stochastic gradient is reduced from a to a/\/B. By appropriately balancing between the 
loss cased by a smaller number of updates and the reduction in the variance of stochastic 
gradients, it is able to maintain the optimal rate of convergence. 

The idea of mini-batches can be incorporated into any stochastic optimization algorithm 
that uses gradient-based updating rules. When the objective function is smooth and convex, 
combining mini-batches with the accelerated stochastic approximation (Lan, 2012) leads to 

/B^ 1 

rate of convergence (Cotter et al., 2011). By setting B = T^'^, we achieve the optimal 
0(l/vT) rate with only 0{T^''^) projections. When the target function is smooth and 
strongly convex, we can apply mini-batches to the optimal algorithms for strongly convex 
functions (Hu et al., 2009; Ghadimi and Lan, 2012), leading to 

rate of convergence (Dekel et al., 2012). In order to maintain the optimal 0{1/T) rate, the 
value of B cannot be larger than yT, implying at least 0{vT) projections are required. In 
contrast, the algorithm proposed in this paper achieves an 0(1/T) rate with only O(logT) 
projections. 

2.2. Projection free algorithms 

Due to the low iteration cost, Frank- Wolfe algorithm (Frank and Wolfe, 1956) or con- 
ditional gradient method (Levitin and Polyak, 1966) has seen a recent surge of interest 
in machine learning (Kazan, 2008; Clarkson, 2010; Lacoste-Julien et al., 2013). At each 
iteration of the Frank- Wolfe algorithm, instead of performing a projection that requires 
solving a constrained quadratic programming problem, it solves a constrained linear pro- 
gramming problem. For many domains of interest, including the positive semidefinite cone 
and the trace norm ball, the constrained linear problem can be solved more efficiently than 
a projection problem (Jaggi, 2013), making this kind of methods attractive for large-scale 
optimization. 

In a recent work (Hazan and Kale, 2012), an online variant of the Frank- Wolfe algorithm 
is proposed. Although the online Frank- Wolfe algorithm exhibits an 0{1/^/T) convergence 
rate for smooth functions, it is unable to achieve the optimal 0(1 /T) rate for strongly 
convex functions. Besides, the memory complexity of this algorithm is 0{T), making it 
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unsuitable for large-scale optimization problems. Another related work is the stochastic 
gradient descent with only one projection (Mahdavi et al., 2012). This algorithm is built 
upon the assumption that the solution domain can be characterized by an inequality con- 
straint ^(w) < and the gradient of g{-) can be evaluated efficiently. Unfortunately, this 
assumption does not hold for some commonly used domain (e.g., the trace norm ball). 
Compared to the projection free algorithms, our proposed method is more general because 
it make no assumption about the solution domain. 

3. Stochastic Optimization of Smooth and Strongly Convex Functions 
3.1. Preliminaries 

We first define smoothness and strongly convexity. 

Definition 1 A function / : 2? — t- M is L-smooth w.r.t. a norm \\ • || if f is everywhere 
differentiahle and 

||V/(w)-V/(w')||* <L||w-w'||, Vw,w'gP. 

where \\ ■ ||* is the dual norm. 

Definition 2 A smooth function / : D — )• R is X-strongly convex w.r.t. a norm \\- \\, if f 
is everywhere differentiahle and 

||V/(w) - V/(w')||* > A||w - w'll, Vw, w' G V. 

To simplify our analysis, we assume that both || • || and || • ||* are the vector £2 norm in the 
following discussion. 

Following (Hazan and Kale, 2011), we make the following assumptions about the gradi- 
ent oracle. 

• There is a gradient oracle, which, for a given input point w returns a stochastic 
gradient g(w) whose expectation is the gradient of F(w) at w, i.e., 

E[g(w)] = VF(w). 

We further assume the stochastic gradients obtained by calling the oracle are inde- 
pendent. 

• The gradient oracle is G-bounded, i.e., 

||g(w)|| < G, VwGP. 

We note that this assumption may be relaxed by assuming the orlicz norm of g(w) to 
be bounded (Lan, 2012), i.e., E[exp(||g(w)|p/G^)] < exp(l). Although our theoretical 
result holds even under the assumption of bounded orlicz norm, we choose the G- 
bounded gradient for simplicity. 
Define w* as the optimal solution that minimizes -F(w), i.e., w* = argmin^g-p F(w). 
Using the strongly convexity of F{w), we have (Hazan and Kale, 2011) 

A 2G^ 

-||w - w*f < F(w) - F(w,) < — — , V w G P. (1) 

2 A 
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3.2. The Algorithm 

Algorithm 1 shows the proposed method for Stochastic Optimization of Smooth and Strongly 
Convex functions (SOS^C), that achieves the optimal 0{1/T) rate of convergence by per- 
forming O(logT) projections. The inputs of the algorithm are: (1) t], the step size, (2) M, 
the fixed number of updates per epoch/stage, (3) B^, the initial batch size, and (4) T, the 
total number of calls to the gradient oracle. With a slight abuse of notation, we use g(w, i) 
to denote the stochastic gradient at w obtained after making the i-th call to the oracle. We 
denote the projection of w onto the domain D by nx)(w). 

Similar to the epoch gradient descent algorithm (Hazan and Kale, 2011), the proposed 
algorithm consists of two layers of loops. It uses the outer (while) loop to divide the learning 
process into a sequence of epochs (Step 5 to Step 12). Similar to (Hazan and Kale, 2011), the 
number of calls to the gradient oracle made by Algorithm 1 increases exponentially over the 
epoches, a key that allows us to achieve the optimal 0(1/T) convergence rate for strongly 
convex functions. We note that other techniques, such as the a-suffix averaging (Rakhlin 
et al., 2012), can also be used as an alternative. 

In the inner (for) loop of each epoch, we combine the idea of mini-batches (Dekel et al., 
2011) with extra-gradient descent (Nemirovski, 2005; Juditsky et al., 2011). We choose 
extra-gradient descent because it allows us to replace in the excess risk bound E[||g(w)p] 
with E[||g(w) — E[g(w)]|p], the variance of the stochastic gradient g(w), thus opening the 
door to fully exploring the capacity of mini-batches in variance reduction. 

To be more specific, in the k-th epoch, we maintain two sequences of solutions {w^}*£]^ 
and {zf}fL^, where z^ is an auxiliary solution that allows us to eff^ectively explore the 
smoothness of the loss function. At each iteration t of the k-th epoch, we calculate the 
average gradients g^ and t^ by calling the gradient oracle B^' times (Steps 6 and 8), and 
update the solutions w^ and z^ using the average gradients (Steps 7 and 9). The batch 
size B^ is fixed inside each epoch but doubles from epoch to epoch (Step 11). This is in 
contrast to most mini-batch based algorithms that have a fixed batch size. This difference 
is critical for achieving 0(1/T) convergence rate with only O(logT) updates. 

3.3. The main results 

The following theorem bounds the expected excess risk of the solution return by Algorithm 1 
and the number of projections. 

Theorem 1 Set the parameters in Algorithm 1 as 

ri = ^^, M = — and B^ = 12r]X. 

The final point w\ returned by Algorithm 1 makes at most T calls to the gradient oracle, 
and has its excess risk hounded by 

E[F(w^)-F(w.)]<^^ = 0f- 



and the total number of projections bounded by 
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Algorithm 1 logT Projections for SOS C 



1: Input: parameters rj, M, B^ and T 

2: Initialize wj G V arbitrarily 

3: Set A; = 1 

4: while 2M Y!1=i B^ < T do 

5: for t = 1 to M do 

6: Compute the average gradient at w^^ over B^ calls to the gradient oracle 






i=l 



t!; = liv ( wf - r/gf 



7: Update 

8: Compute the average gradient at Zj over B calls to the gradient oracle 

9: Update 



wf,i = ni,(w4^-7?f^ 



10 
11 
12 
13 
14 



end for 



wi+' = 17 Et=i z^ and B^+^ = 2B^ 
k = k + l 
end while 



Return: ^v^ 



Theorem 1 shows that in expectation, Algorithm 1 achieve an 0(1/T) convergence with 
O(logT) updates. The following theorem gives a high probability bound of the excess risk 
for Algorithm 1. 



Theorem 2 Set the parameters in Algorithm 1 as 



-, M = —- and B = ar]X, 



VQL' r/A 
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where a is defined below. For any < 5 < 1, let 



^4 



fct 



log.(£ + l 



o(iogr), 



8M 



8M 



, o.« o o.« , AN 4 , 5 4iV 

a =max <j 400 log — ^, 1 + 641og^ — ^ log -^ + - log -^ 
(5 6 \ 5 ^ 5 



(2) 
(3) 



=0 



log log T + loe 



1 



N 



log2 



4Mr' 

rjX 



o(iogr). 



(4) 



T/ie ^noZ ]9oini wf returned by Algorithm 1 makes at most T calls to the gradient oracles, 
performs 



8y/QL 
A 



^°^^(£ + ^ 



o(iogr) 



projections, and with a probability at least 1 — 6, has its excess risk bounded by 



F(wf')-^(w*) < 



XT 



O 



(iogiogr + iogi/<5)^ 



Remark: It is worth noting that we achieve the high probability bound without making 
any modifications to Algorithm 1. This is in contrast to the epoch gradient descent algo- 
rithm (Hazan and Kale, 2011) that needs to shrink the domain size in order to obtain the 
desirable high probability bound, which could potentially lead to an additional computa- 
tional cost in performing projection. We remove the shrinking step by effectively exploring 
the peeling technique (Bartlett et al., 2005). 

The number of projections required by Algorithm 1, according to Theorem 2, exhibits 
a linear dependence on the conditional number L/X, which can be very large when dealing 
with ill-conditioned optimization problems. In the deterministic setting, the convergence 
rate only depends on the square root of the conditional number (Nesterov, 2004, 2007). 
Thus, we conjecture that it may be possible to improve the dependence on the conditional 
number to its square root in the stochastic setting, a problem that will be examined in the 
future. 



4. Analysis 

We here present the proofs of main theorems, 
supplementary material. 



The omitted proofs are provided in the 



4.1. Proof of Theorem 1 

Since we make use of the the multi-stage learning strategy, the proof provided below is 
similar to the proof in (Hazan and Kale, 2011). We begin by analyzing the property of the 
inner loop in Algorithm 1, which is a combination of mini-batches and the extra- gradient 
descent. To this end, we have the following lemma. 
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Lemma 3 Let rj = l/[\'bL] in Algorithm 1. Then, we have 



( 1 *^ 



Ffw* 



< 



t=\ 



2Mr] 

M 



M 



—y 



W* 



t=l 



M 






A;||2 



t=l 



t -ff,z^-w* 



(5) 
(6) 



t=i 



where 



g1 = VF(w,^) ant? f/= = VF(z,^) 



Taking the conditional expectation of the inequality, we have 



E 



fc-i 



^f^E 



t=i , 



F(w,) < 



2Mr] 



+ 



6r]G^ 



where Efc_i[-] denotes the expectation conditioned on all the randomness up to epoch k — 1. 

The quantity in (5) illustrates the advantage of the extra-gradient descent, i.e., it is able to 
produce variance-dependent upper bound when applied to stochastic optimization. Because 



of mini-batches, the expectations of ||gf — g^P and ||f/^ 



ff IP are smaller than G'^/B'' 



which leads to the tight upper bound in the second inequality. 

Based on Lemma 3, we get the following lemma that bounds the expected excess risk 
in each epoch. 

Lemma 4 Define 

Ak = F(wf) - F(w,). 

Set the parameters rj = l/[^/^L\, M = ^/[rj\] and B^ = \2r]\ in Algorithm 1. For any k, 

G2 



we have 



Proof It is straightforward to check that 

B^ = I2r]\2^^^ 



A2'=-2 • 



2477^2 



Vk 



(7) 



We prove this lemma by induction on k. When k = 1, we know that 

Vi. 



(1) 9/^2 r'2 

Ai = F(w{) - F(w,) < ^^ 



A 



A2 



1-2 



Assume that E[Afc] < V^ for some k > 1, and we prove the inequality for k + \. Prom 
Lemma 3, we have 



Efc- 



F ( w^+i 



F(w*) < 



2Mr? 



+ 



6t?G2 
B'^ ' 



Thus 
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E 



F(w^i) 



< 



E[||w5;' - w 
2Mr] 



+ 



F(w,) 
6r/G2 



S^ 



W E[2(F(w^) - F(w,))/A] ^ 6rjG^ 



2Mr] 



Qk 



(7) E[Afe] y,, . ^fc ^ V^fc _ , . 



We are now at the position to prove Theorem 1. 
Proof [Proof of Theorem 1] From the stopping criterion of the outer loop in Algorithm 1, 
we know that the number of the epochs is given by the largest value of k such that 



2M^B' < T. 



j=i 



Since 



2M ^B' = 24Mr/A ^ 2^"^ = 96(2^' - 1) , 



i=l 



4 = 1 



the final epoch is given by 



fct 



logo I hi 



^fct+l 



and the final output is w^ ^ . From Lemma 4, we have 



E[F(wf +^)] -F(w.) < y,t+i = ^ < ^, 



fct+ls 



where we use the fact 



- 2 V 96 J - 192 



The total number of projections is 



2Mk^ 



8^/6L 



T 

logo ( hi 

^^ * 96 
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4.2. Proof of Theorem 2 

Compared to the proof of Theorem 1, the mam difference here is that we need a high 
probabihty version of Lemma 3. Specifically, we need to provide high probability bounds 
for the quantities in (5) and (6). 

To bound the variances given in (5), we need the following norm concentration inequality 
in Hilbert Space (Smale and Zhou, 2009). 

Lemma 5 Let % he a Hilbert Space and let ^ he a random variable on {Z,p) with values 
in %. Assume ||^|| < B < oo almost surely. Let {S,i}^i be independent random drawers of 
p. For any < 5 < 1, with a probability at least 1 — 5, 



^ rn 



i=\ 



AB , 2 
< ^log-. 
'm 



Based on Lemma 5, it is straightforward to prove the following lemma. 
Lemma 6 With a probability at least 1 — 6/2, we have 



, k ku ^ ^G ^ AM ^ 



M. 



(8) 



Similarly, with a probability at least 1 — 5/ A, we have 



AG 



8M 



<^=log^, Vt = l,...,M. 
B^ 6 



(9) 



We define the Martingale difference sequence: 

yk /fk fk k 

^t — \H H 1 ^t 



W. 



In order to bound the summation of Z^ in (6), we make use of the Berstein inequal- 
ity for martingales (Cesa-Bianchi and Lugosi, 2006) and the peeling technique described 
in (Bartlett et al., 2005), leading to the following Lemma. 

Lemma 7 We use Ei to denote the event that all the inequalities in (9) hold. On event 
El, with a probability at least 1 — 5 /A, we have 



M 



where 



. AG^r]M ^ o8M G^ 

< ■ — loe 1 

- B^ ^ S XB^ 



, , 2 8M / An A^ oAn 
1 + 641og^ ^^ log -^ + - log^ -^ 
6 \ 6 9^5 



^^1 



w 



2 

*ll ) 



t=l 



n 



log2 



AMB'' 
rjX 



(10) 



Substituting the results in Lemmas 6 and 7 into Lemma 3, we obtain the lemma below. 
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Lemma 8 For any {) < 5 < 1, with a probability at least 1 — 5, we have 



( 1 ^'^ 



F(w,) < 



t=i 



|w'{ — w^, 
2Mr] 






+ 



A^^M 



, , 9 8M /, 4n 4 , 9 4n 
1 + 641og2 log ^ + - log2 -^ 

\ y 



where n is given in (10). 



Based on Lemma 8, we provide a high probabihty version of Lemma 4, that bounds the 
excess risk in each epoch with a high probability. 

Lemma 9 Set the parameters r] = 1/[\'6L], M = 4/[ryA] and B^ = arjX in Algorithm 1, 
where a is defined in (3). For any k, with a probability at least (1 — 5)^~^ , we have 

Ak = F{w1)-F{w,)<Vk = ^^. 

Proof We follow the logic used in the proof of Lemma 4. 
It is straightforward to check that 



B'' = ar]\2^-^ 



2arjG^ 

Vk 



When k = 1, with a probability (1 — 5)^ ^ = 1, we have 

(1) 9/~'2 r~<2 

Ai = F(w}) - F(w,) < '^ 



X 



A2 



1-2 



V^. 



Assume that with a probability at least (1 — 6) , Aj^ < Vk for some k > 1. We now prove 
the case for A; + 1. Notice that A^ defined in (4) is larger than n defined in (10). From 
Lemma 8, with a probability at least 1 — 5, we have 



Afc+i = F(w 



.+1^ 



<^ 



F(w,) 

Iw^-wJP WOG^T]^ o8M G^ 

- — — — I loe 1 

2Mr] B^ ^ S XB^M 



-, n.. 28M / AN 4, 2 4A 
1 + 641og2 log ^ + - log2 -^ 

\ V 



^Afc 400, 2 8MFfc 1 
4 a S o a 



8M 



AN 



4N\ 



1 + 64 log^ — ^ log -^ + - log^ -^ 



Vk 



5 \ ^ 6 9"° 5 J. 
Using the definition of a in (3), with a probability at least (1 — 6) we have, 



1, 



1, 



1, 



1, 



Afc+i < -Vk + -Vk + -Vk = -Vk = Vk+i. 



Now, we provide the proof of Theorem 2. 
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Proof [Proof of Theorem 2] The number of epochs made is given by the largest value of k 
satisfying 2M XlLi ^' ^ ^- Since 

k k 

2M^B' = 2MaXi] ^ 2''^ = Sa{2^ - 1), 

A;T defined in (2) is the value of the final epoch, and the final output is w^ . From 
Lemma 9, we have with a probability at least (1 — 5) 

„, h'f-^-^\ ,-,/ \ A ^ Tr ^ 2Gr _ 32q;G 

F(wt +^) - F(w.) = A,t+i < T4t+i = ^^TZT3^ = ^^ < ^^, 

where we use the fact 



2 V8a y ~ 16a' 

We complete the proof by using the property that (1 Y i^ ^^ increasing function when 

X > 1, which implies 




5. Experiments 

In this section, we present numerical experiments to support our theoretical analysis. We 
studied the following algorithms: 

1. logT: the proposed algorithm that is optimal for SOS^C but only needs log(r) pro- 
jections; 

2. EP_GD: the epoch gradient descent developed in (Hazan and Kale, 2011), which is 
also optimal for SOS^C but needs 0{T) projections; 

3. SGD: the stochastic gradient descent with step size rjt = l/{Xt) (Shalev-Shwartz et al., 
2011), which achieves 0(logr/T) rate of convergence for general SOS'^C and needs 
0{T) projections. 

We first consider the a simple stochastic optimization problem adapted from (Rakhlin et al., 
2012), which is both smooth and strongly convex. The objective function is F{W) = 2 II^IIf 
and the domain is the 5x5 dimensional positive semidefinite (PSD) cone. The stochastic 
gradient oracle, given a point W , returns the stochastic gradient W+Z where Z is uniformly 
distributed in [—1, 1]^^^. Because of the noise matrix Z, all the immediate solutions are not 
PDS and we need to project them back to the PSD cone. To ensure the eigendecomposition 
only involving real numbers, we further require Z to be symmetric. Notice that for this 
problem we know W^. = aTgTai^n.^ry^F{W) = 0^^^. Since the gradient of W^, is 0^^^, 
it can be shown that SGD also achieves the optimal 0(1/T) rate of convergence on this 
problem (Rakhlin et al., 2012). 
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(a) (F(wt) — F(\v^)) X T versus T (b) F(wr) versus the number of projec- 
tions P 

Figure 1: Results for stochastic optimization of F(W) = ^\\W\\'j^ over the PSD cone. The 
experiments are repeated 10 times and the averages are reported. 



Let Wt be the solution returned after making T calls to the gradient oracle. To verify if 
the proposed algorithm achieves an 0(1/T) convergence, we measure {F{Wt) — FiW^,)) x 
T versus T, which is given in Fig. 1(a). We observe that when T is sufficiently large, 
quantity {F(Wt) — F{W^:)) x T essentially becomes a constant for all three algorithms, 
implying 0(1/T) convergence rates for all the algorithms. We also observe that the constant 
achieved by the proposed algorithm is slightly larger than the two competitors, which can 
be attributed to the term (log log T)^ in our bound in Theorem 2. To demonstrate the 
advantage of our algorithm, we plot the value of the objective function versus the number 
of projections P in Fig. 1(6). We observe that using our algorithm, the objective function 
is reduced significantly faster than other algorithms w.r.t. the number of projections. 

In the second experiment, we apply our algorithm to the regularized distance metric 
learning (Jin et al., 2009). The goal is to solve the following problem 



^^^^(^.,m),i^„y,Myij('^ 



\^r-^j\\ll))] + ^\\WfF, 



where Xj is the instance, and yi is Xj's label, yij is derived from labels yi and yj (i.e., yij = 1 



if yi = yj and —1 otherwise), ||x||^j = x Mx, and 



log(l + exp(— z)) is the logit loss. 



During the optimization process, the call to the gradient oracle corresponds to generate a 
training pair {{xi,yi),(xj,yj)} randomly. To estimate the value of objective function, we 
evaluate the average empirical loss on 10^ testing pairs, which are also generated randomly. 
Fig. 2 shows the value of the objective function versus the number of projections P. Again, 
this result validates that the proposed algorithm log T is able to reduce the number of 
projections dramatically without hurting the performance. 

6. Conclusion 

In this paper, we study the problem of reducing the number of projections in stochastic 
optimization by exploring the property of smoothness. When the target function is smooth 
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fa) Mushrooms 



(6) Adult 



Figure 2: Results for the regularized distance metric learning on the Mushrooms and Adult 
data sets. F{Wt) is measured on 10^ testing pairs and the horizontal axis P mea- 
sures the number of projections performed by each algorithm. The experiments 
are repeated 10 times and the averages are reported. 



and strongly convex, we propose a novel algorithm that achieves the optimal 0(1/T) rate 
of convergence by only performing O(logr) projections. 

An open question is how to extend our results to stochastic composite optimization (Lan, 
2012), where the objective function is a combination of non-smooth and smooth stochastic 
components. We plan to explore the composite gradient mapping technique, introduced 
in (Nesterov, 2007), to see if we can achieve an 0(1/T) convergence rate with only O(logT) 
projections. 
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Appendix A. Proof of Lemma 3 

We need the following lemma that characterizes the property of the extra-gradient descent. 

Lemma 10 (Lemma 3.1 in (Nemirovslci, 2005)) Let Z he a convex compact set in Eu- 
clidean space £ with inner product {■,■), let \\-\\ he a norm on £ and || • ||* he its dual norm, 
and let u}{z) : Z t-^ M. he a a-strongly convex function with respect to || • ||. The Bregman 
distance associated with oj for points 'z.^'w ^ Z is defined as 

Bi^{z, w) = uj{z) — uj{w) — (z — w, Va;(w)). 

Let U be a convex and closed suhset of Z, and let z_ G Z, let ^,r} £ £, and let 7 > 0. 
Consider the points 

w = argmin{(7^ - Vtj(z_), y) + a;(y)}, 
yeu 

z+ = argmin{(7r7 - Va;(z_),y) + w(y)}. 
yeu 

Then for all z gU one has 

2 

(w - z,7r7) < BUz, z_) - B^z, z+) + —\\t] - $\\l - ^{||w - z_f + ||z+ - wf }. 

a 2 

Proof [Proof of Lemma 3] We first state the inner loop in Algorithm 1 below. 
for t = 1 to M do 

Compute the average gradient at w^ over B calls to the gradient oracle 



Update 



i=l 



z\ = Hv ( wf - r]g^ 



Compute the average gradient at z^ over B calls to the gradient oracle 

B^ 



Update 



i=l 



wj\i = nx, wf-r/f/= 



end for 
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To simplify the notation, we define 

gf=VF(wf)andf,^ = VF(z,^). 

Let the two norms || • || and || • ||* in Lemma 10 be the vector H.2 norm. Each iteration in the 
inner loop satisfies the conditions in Lemma 10 by doing the mappings below: 

U = Z = S ^V, a;(z) ^ -||zf , a ^ 1, 7^7?, 

Z_ ^ wf , $, ^ gf , T} ^ f/=, W ^ Z4^ Z+ ^ W^^+i, Z ^ W,. 

Following Lemma 10, we have 



(Z^-W„7?f,^') 

^ii__^ II — t+L ii_ _^„^ gfc _f/« H — wf -zf r 

— r\ rj ' / M ^C til Q 1 1 Z til 

II t- ll9 II h Il9 

■WIT-"' -WIT- \\ ^ ITT''' in r ■" / 

<]L^- ^ _ ^ ^ + 37?2 (^llgf - gf ||2 + ||ff _ ff ||2 + llgfc _ ffc||2 

1 11 h frll9 

— wf - zf r 

'|W,^-W,||2 ||wf+i-W,||2 2/^ fc „fc||2 , iifife .fc||2 



< " ^ , "' - " ^"% " + 3r?^ iigf - gf r + iiff - mv) (11) 



2 

+ 3ry2||g,^-fff-i||w,^-z,^f 

^||w^-W*|P l|w^i-W*||2 o 2 /ii-fr frii2 ii?fr ^;rii2 

<^ - ^ ^ + 3^' (||gt^ - gi^f + ¥t - ft II 

4- '^n2r2|| fc _ fc||2 _ i|| fc _ fc||2 
-\- 67] 1j ||Wj Zj II -||Wj Z^ II 



W"y- — ■VV*|| II vv^ I 1 — w^^ii „9/ii-i- i-ii2 nTii. ,.h,,2 

<^ — ^ - \ — - + ^v M - gt 11 + lift - ft 11 



i9 II k Il9 

r llwf+i - - 11^ 
2 2 

where in the fifth line we use the smoothness assumption 

||g,^ - f/=|| = ||VF(w,^) - VF(z^)|| < L\\w1 - z1\\. 
From the property of A-strongly convex function and (11), we obtain 
F{z^) - F(w.) 
<(f^z,^-w,)-^||z,^-w,||2 

= {i,\z1 - w,) + (f,'^ - i,\ z1 - w,) - hjl - w,f 



w+ — vi^* r w, 1 1 — w. 



|2 



f+1 "'*ll , Q^/||=.fc „fc||2 I nnfc nfc||2 



< " \^ " - " '^\^ " + 3^ [\\rt - grr + iift'^ - ft'^ 
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Summing up over all t = 1, 2, . . . , M, we have 

M 



t=l 



< 






w'l — w* 



2?? 



M M 

t=i 



fc||2 
t II 



U=l 



M 



+ E(f/=-f/=,z,^-w, 



A 



H 5^t 



M 

E 



w* 



t=l t=l 

Dividing both sides by M and following Jensen's inequality, we have 

M 



U^""' 



M 



M 



F(w, 



t=i 



<^E^(-t)-Fiw.) 



t=i 



M 



< 



|wi -w,|| 3r] , '^11 . .,,2 



M 



2Mi] 

M 



+ S Eiig*'-g*T + EiifNf.T + 



M 



u=i 



t=i 






^ w.lP. 



t=l 



t=l 



(12) 



which gives the first inequality in Lemma 3. 

Let Efc_i[-] denote the expectation conditioned on all the randomness up to epoch k — 1 
and E^~ [•] denote the expectation conditioned on all the randomness up to the t — 1-th 
iteration in the A;-th epoch. Taking the conditional expectation of (12), we have 

M 



Efc-1 



W^E^ 



t=l 



F(w,) 



< 



I k Il9 



2Mrj 



3r] 

M 



M 






fc-1 



vt=l 



M 

t=i 



\fk tk\\2 
\^t H II 



(13) 



M 



+ mEe.-i i^i^-n 



k k 



M 



t=l 
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where we drop the last term, since it is negative. To bound E 



fc-1 llSt -gt II h 



we have 



Ek- 



\St-Stf 



Efc-i 



B" 



Qk 



Eg(-*' 



gt 



i=l 



=E 



fc-i 






\B 



B" 



i=l 



+ pijl^'^-l 



B 



g(wf,i)-gf 






^) - gt' 



-pt-i 



g(w^j)-gf 



m 



^ \^^k-l 



i=l 



5(wr,i)-gf 



< 



G2 



(14) 



where we make use of the facts g(w^,i) and g('w^,j) are independent when i ^ j, and 



K' 



g(wf,i)-gf 



0, E*-i 



|g(wf,i)-g? 



fc||2 



< E 



t-i 



|g(w^i)f 



<G2, Vi = l,...,S^ 



Similarly, we also have 



E 



fc-i 



|ft^ - ft 



fc||2 



< 



G2 

Qk- 



Notice that fj is an unbiased estimate of f/^, thus 



E 



fc-i 



fc „fc 



(ff-ff,zf-w 



E 



fc-i 



(E*-i f.'^-f,^' 



0. 



(15) 
(16) 



Substituting (14), (15), and (16) into (13), we get the second inequality in Lemma 3. 



Appendix B. Proof of Lemma 6 

1 v^-B* 



Proof Recall that gf = -^ Z]i=i g(wt , i), thus 



|gt*^-gt1 



-. B" 



j=i 



Since ||g('w^,i)|| < G, and E[g(w^,i)] = g^, we have with a probability at least 1 — 5 



|gt - gt 



< 



AG 



Bk 6 



We obtain (8) by the union bound and setting 6/2 = M5. The inequality in (9) can be 
proved in the same way. ■ 
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Appendix C. Proof of Lemma 7 

We first state the Berstein inequality for martingales (Cesa-Bianchi and Lugosi, 2006), 
which is used in the proof below. 

Theorem 3 (Bernstein's inequality for martingales). Let Xi, . . . ,Xn be a bounded mar- 
tingale difference sequence with respect to the filtration T = {J-i)i<i<n and with \Xi\ < K . 
Let 

i 

Si = 2_^ Xj 

be the associated martingale. Denote the sum of the conditional variances by 



^l = T.^[x!\^t 



t-i 



Then for all constants t, v > 0, 

Pr 
and therefore, 



t=i 



max Si > t and E„ < u 

2=1,. ..,n 



< exp 



2{i^ + Kt/3)J ' 



Pr 



max Si > \/2vt H — Kt and S„ < z^ 

i=l,...,7i 3 



< e" 



To simplify the notation, we define 

M 



A 
C 



E 



wjr < 



AMG^ 



i=l 

4G , 8M 
, — log— ^. 
VW^ 6 



A2 



In the analysis below, we consider two different scenarios, i.e., A < r]G'^ /[XB^] and A > 
riG^/[XB'']. 

C.l. A<r]G^/[XB''] 

On event Ei, we can bound 

yk ^ Ufk ffc|||| fc w II < '^llf'^ fku2,'^uk ||2 ^ ^^2 , 1|| fc ||2 

■^i^||It — ■'■tllllt — '^*ll — Tll'-t — '■til ~i II ^t — '^*ll — T "' lit — "'*||- 



Summing up over all t = 1, 2 

M 

•^t 



, . . . , XVJ. , 



4' 



M 



> ^t < -^ — + -> m -^*\\ <^ — + T-Fri 



(17) 



t=i 



t=i 
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C.2. A>r]G'^/[\B^] 

Similar to the above proof, on event Ei, we bound 

1 



\^t I ^ \\h ~ H WW^t ~ w* 



t t II ^11 t Q ^ I 



where can be any nonnegative real number. Denote the sum of conditional variances by 

M M 



t=l 



t=l 



where E^~ [•] denote the expectation conditioned on all the randomness up to the t — 1-th 
iteration in the k-th epoch. 

Notice that A in the upper bound for \Zf\ and S|^ is a random variable, thus we cannot 
directly apply Theorem 3. To address this challenge, we make use of the peeling technique 
described in (Bartlett et al., 2005), and have 

/ M 



AMG'^ 



where 






A2 



M 



:Pr 5]zf>2^/C^+-^ + 



.t=i 



4 /C72 eA\ 



4 / 



a 



6A 



rjG^ 



max |Zf I <^ + --, ^If < C^A T^ <A< 



A5* 



4MG2 



i=l \t=l ^ ^ 



C 



OA 



^2 A vG j_i 



max|Zf|<- + -,Si,<CM,.^^,. 



<A< 



XBk 



M 



< 



epme^; 



> 2i 



i=l 



\t=l 






maxlZi"! < 



M 



C2 e rjG^ 



e 4 XB'' 



9* y^ < c- 



XB^' 






max|Zf|<- + -^^, 



:2\S1, <C^ 



A5^' 



<ne ^, 



n 



log2 



4MB''' 
rjX 



22 



0{logT) Projections for Stochastic Optimization 



and the last step follows the Bernstein inequality for martingales in Theorem 3. Setting 

3A An 

= --, andr = log^, 
At S 



with a probability at least 1 — 6/A we have 

M 



<2VC^+l(^ + ^). = 2VCM;+15f!,. + ^ (18) 

3 \ 6 A J 9A 4 

4 . XA 16C72 \A AC^ (^ An 4, oAn\ \A 

A 4 9A 4 \\S^5J^ 

We complete the proof by combining (17) and (18). 
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